Data heterogeneity in federated learning with Electronic Health Records: Case studies of risk prediction for acute kidney injury and sepsis diseases in critical care

With the wider availability of healthcare data such as Electronic Health Records (EHR), more and more data-driven based approaches have been proposed to improve the quality-of-care delivery. Predictive modeling, which aims at building computational models for predicting clinical risk, is a popular research topic in healthcare analytics. However, concerns about privacy of healthcare data may hinder the development of effective predictive models that are generalizable because this often requires rich diverse data from multiple clinical institutions. Recently, federated learning (FL) has demonstrated promise in addressing this concern. However, data heterogeneity from different local participating sites may affect prediction performance of federated models. Due to acute kidney injury (AKI) and sepsis’ high prevalence among patients admitted to intensive care units (ICU), the early prediction of these conditions based on AI is an important topic in critical care medicine. In this study, we take AKI and sepsis onset risk prediction in ICU as two examples to explore the impact of data heterogeneity in the FL framework as well as compare performances across frameworks. We built predictive models based on local, pooled, and FL frameworks using EHR data across multiple hospitals. The local framework only used data from each site itself. The pooled framework combined data from all sites. In the FL framework, each local site did not have access to other sites’ data. A model was updated locally, and its parameters were shared to a central aggregator, which was used to update the federated model’s parameters and then subsequently, shared with each site. We found models built within a FL framework outperformed local counterparts. Then, we analyzed variable importance discrepancies across sites and frameworks. Finally, we explored potential sources of the heterogeneity within the EHR data. The different distributions of demographic profiles, medication use, and site information contributed to data heterogeneity.

Introduction Acute kidney injury (AKI) and sepsis are two types of potentially life-threatening clinical conditions that complicate treatment, clinical trajectories, and potentially worsen outcomes of a significant number of hospitalized or intensive care unit (ICU)-patients [1][2]. For patients with AKI or sepsis, morbidity and mortality are usually higher than patients without AKI or sepsis, with as much as a sevenfold increased mortality risk, regardless of type of ICU (for example, medical, surgical, or cardiac) [3][4]. Moreover, healthcare utilization within the ICU is often higher for patients with these conditions. For example, patients with AKI and sepsis often require hemodialysis, inotropic support, or mechanical ventilation [5]. Therefore, early prediction of AKI or sepsis risk in critical care settings can facilitate early interventions that are likely to provide benefit, including aggressive treatment with fluid resuscitation and antimicrobials that may improve patient outcomes [6].
Recently, due to wider availability of electronic health record (EHR) data and advances in artificial intelligence (AI), machine learning (ML) based disease risk prediction has attracted more attention in the ICU setting [7]. Previous studies on AKI and sepsis onset risk prediction mainly focused on building a predictive model on medical data from single hospitals [8][9][10][11][12][13]. However, building an accurate and generalizable disease risk prediction model requires a large amount of data from a diverse patient population [8]. Collecting the data together from different hospitals and constructing a unified risk prediction model on the combined data can lead to better prediction performance. Moreover, using multiple hospitals or sites data over single institution data can add to the generalizability of ML models [14]. A recent study has shown that creating more generalizable models can increase algorithmic fairness, yet many published models lack this generalizability across geographic locations and demographics [15]. However, due to the highly sensitive nature of EHR in terms of protected health information (PHI) of patients, aggregating multiple institutions' data all together is challenging [16].
More recently, federated learning (FL) has emerged as a promising strategy on building ML models with fragmented sensitive data [17]. FL is one mechanism of training ML models across multiple decentralized sites holding local data samples without exchanging them [18]. It builds a central aggregator to obtain global ML model's parameters by iteratively exchanging model parameters from local ML models. However, data heterogeneity in the FL framework may affect prediction performance [19]. For example, different hospitals have different populations, which may have a high degree of variability in the patient treatment, such as different medications they administer and different procedures they conduct. This heterogeneity especially affects the performance of sepsis and AKI prediction models which rely on patient demographics, disease history, and medications [20]. Both AKI and sepsis are also highly heterogeneous [21]. This makes models built with conventional FL strategies such as federated averaging challenging to generalize across clinics, limiting their use [7,22,23]. Several federated architectures have been proposed to mitigate effects of data heterogeneity in other domains and built personalized, but globally correlated, models to mitigate drift across sites [23], such as model-agnostic meta-learning (MAML), federated multitask learning, and knowledge distillation [24][25][26][27][28]. However, it is not clear how such data heterogeneity problem will impact building risk prediction models in clinical medicine.
To fill this research gap, we comprehensively investigate the effects of data heterogeneity in the FL framework for predicting the onset risk of AKI and sepsis in ICU setting using EHR data from multiple hospital sites. We built multiple predictive models in local, pooled, and FL settings. The local setting built an individual model for each site from its own data. The pooled setting built a global model shared across all sites with their combined data. The FL setting also built a global model, where each local site did not share data with others, but updated model parameters locally and shared the updated model parameters to a central aggregator, which was used to update the global model parameters and shared back with each site. By comparing the performance of models trained from different settings with each other, we investigated how data heterogeneity would impact the federated risk prediction models. We also explored the potential sources of the heterogeneity within EHR data by analyzing predictor importance across settings and sites. The differences were contrasted according to patient and hospital information to elucidate sources of heterogeneity and how they would potentially impact the different predictive modeling settings. The overall workflow of our study is shown in Fig 1. The notable contributions of this work to the literature are as follows: • With the context of AKI and sepsis onset risk prediction in ICU setting, a comprehensive comparison in terms of prediction performance among local, pooled, and federated settings were conducted with a set of ML models.
• We have identified important predictors for AKI and sepsis risk and performed exhaustive analysis on they would impact the prediction results. These predictors can be used by medical specialists to monitor the risk of AKI and sepsis for patients in ICU, while accounting for the specifics of their own hospitals. In addition, we have delineated differences in feature importance across medical sites, outlining metrics for direct comparison of feature importance across different settings (i.e., local, pooled, and federated).
• We have performed a thorough analysis on the potential sources of heterogeneity between hospital sites according to patient demographic, medication, and lab data, as well as hospital information such as available unit types. We outline how these sources of heterogeneity could be connected to the varying predictor importance derived across sites and settings.  In the AKI 24h setting, the pooled MLP model identified last measured level of creatinine (creatinine_last), last measured hematocrit level (hematocrit_last), Furosemide, bg_paco2_min, maximum potassium level (potassium_max), minimum creatinine levels (creatinine_min), last measured systolic blood pressure (sysbp_last), hemoglobin_first, minimum bicarbonate level (bicarbonate_min), and last measured calcium level (calcium_last) as the top 10 most important variables. All factors except Furosemide were lab tests and vital signs. The pooled LR model shared several important factors with the pooled MLP model, with the addition of age, first measured calcium level (calcium_first), and last measured blood urea nitrogen level (bun_last). Of note, in the pooled MLP model, creatinine_last of~4 mg/dL is associated with an exp(0.4) = 1.5-fold increase in risk of AKI 24h. In the pooled LR model, creatinine_last shows a similarly strong relationship as the pooled MLP to AKI 24h risk. A bun_last measurement of~60 mg/dL is associated with a exp(0.2) = 1.2-fold increase in risk of AKI 24h. In the pooled LR model, the risk of AKI 24h given administration of furosemide, is greater than AKI risk in the MLP model, with an odds ratio of exp(0.1) = 1.1.

Clinical interpretation of sepsis and AKI prediction models
In the AKI 48h setting, the pooled MLP model identified creatinine_last, hemoglobin_first, bg_paco2_first, potassium_min, bun_last, Furosemide, maximum partial pressure of carbon dioxide (bg_paco2_max), hemoglobin_max, sysbp_last, and first measured platelet count (pla-tlet_first) as the top 10 most important variables. All factors except Furosemide were lab tests and vital signs. The pooled LR model shared several important factors with the pooled model, with the addition of the mean systolic and diastolic blood pressure (meanbp_first) and minimum glucose level (glucose_min). Like the 24h setting, in the 48h pooled MLP model, creatini-ne_last of~4 mg/dL is associated with an exp(0.4) = 1.5-fold increase in risk of AKI. A bun_last measurement of greater than~25 mg/dL is associated with an increased risk of AKI Each panel shows the marginal effects of each of the most impactful features ranked among the top 10 for predicting AKI 24h or 48h using pooled models. The x-axis gives the raw values of each feature, and the y-axis gives the logarithmic of estimated odds ratio (i.e., the SHAP value) for sepsis, AKI 24h or AKI 48h, when a feature takes a certain value. Each dot represents the SHAP value of a sample. The LOWESS curve, used for smoother extrapolating across all the dots, is plotted in red for all panels. ( 48h. In the pooled LR model, creatinine_last and bun_last show similarly strong relationships as the pooled MLP model. Furosemide is considered an important medication across all AKI settings and model architectures.
For the AKI 24h setting, the federated MLP and LR model consider more medications important than their respective pooled counterparts. Medications considered important by the federated MLP model include Furosemide, Potassium Chloride, Aspirin, and Metoprolol, whereas the federated LR model considered Insulin important as well. Interestingly, the federated MLP model considered the patient's choice of elective surgery (electivesurgery) as an important feature, albeit a relatively small increase (exp(0.02) = 1.02-fold) in risk of AKI 24h. Like the 24h setting, the federated MLP and LR models of the AKI 48h setting considered more medications important than their respective pooled counterparts. Both the MLP and LR Each panel shows the marginal effects of each of the most impactful features ranked among the top 10 for predicting AKI 24h or 48h using federated models. The x-axis gives the raw values of each feature, and the y-axis gives the logarithmic of estimated odds ratio (i.e., the SHAP value) for sepsis, AKI 24h or AKI 48h, when a feature takes a certain value. Each dot represents the SHAP value of a sample. The LOWESS curve, used for smoother extrapolating across all the dots, is plotted in red for all panels. (

PLOS DIGITAL HEALTH
model consider administration of Aspirin and Insulin as important factors. The federated MLP for the 48h setting uniquely finds the minimum ratio of "partial pressure of oxygen" to "fractional inspired oxygen" (bg_pao2fio2ratio_min) and maximum level of glucose (glucose_max) as important factors. Local models shared numerous important factors with pooled and federated models, depicting similar relationships between feature value and risk of sepsis/AKI (S4 Fig).

Source of prediction performance heterogeneity across model architectures, frameworks, and sites
To better understand differences in feature importances across hospital sites and model frameworks, we performed a qualitative analysis which looked at the most important variables selected by models and their prevalence across sites.  Each dot corresponds to one of the most important features ranked among the top-100 by at least one of the seven models; y-axis measures the proportions of sites that identified the feature as top-100, or "commonality across sites"; x-axis measures the mean of feature importance rankings measured as "soft ranking" (the closer it is to 1, the higher the feature ranks). Top-100 is an arbitrary cutoff we used to analyze the most important features to illustrate heterogeneity. Each feature is also color coded by the interquartile range (IQR) of the ranks across sites (the higher the IQR is, the more disagreement across sites on the importance of that feature). https://doi.org/10.1371/journal.pdig.0000117.g005

PLOS DIGITAL HEALTH
Data heterogeneity in federated learning with Electronic Health Records their importance rankings in AKI prediction models, where the y-axis is the proportion of sites that consider the feature as a top 100 feature (for the specific model architecture). For example, a feature that has a y-value of 1.0 is deemed important at all sites, whereas if a feature has a y-value of 0.1429 (1/7), it is only considered important at one site. The x-axis shows the importance ranking of the feature, averaged across the sites it is considered important (i.e., top 100) in (i.e., a feature is more important if it is closer to 1). Results for the sepsis prediction setting are available in the Supplemental Information and S8  Chloride, among others, have different trends with the prediction diagnosis depending on the site. Fig 6 shows the distribution of important features for the pooled models across sites. Both AKI 24h and 48h settings' pooled models have relatively fewer features that are only important at a small number of sites compared to the local model framework. For the pooled MLP and LR AKI 24h or 48h model, the universally important features mainly included creatinine_last, potassium_max, and creatinine_min. All AKI pooled models have features that are uniquely important to the pooled models (i.e., these features were not considered as part of the top 100 features at any local site). The pooled MLP AKI 24h model uniquely considered administration of Nitroglycerin moderately important. The pooled LR AKI 24h model uniquely considered administration of Metoclopramide Mupirocin, Lidocaine, and race_black as slightly important. The pooled MLP AKI 48h model uniquely considered administration of Hydromorphone and bilirubin_last as slightly and moderately important respectively. The pooled LR AKI 48h model also uniquely considered administration of Hydromorphone and

PLOS DIGITAL HEALTH
Data heterogeneity in federated learning with Electronic Health Records bilirubin_last as moderately important. Taken together, these differences suggested that there is slight variability in uniquely important features among models. Fig 7 shows the distribution of important features for the federated models across sites. Like the pooled models, both MLP and LR federated models have relatively fewer features that are important at only a small number of sites compared to the local model analysis. The federated MLP AKI 24h model shares its universally important features with its pooled counterpart, namely attributing importance to creatinine_last, potassium_max, and creatinine_min, among others. These features are universally important in the federated LR architecture, as well as in the 48h setting. Some of the federated AKI models have uniquely important features as well. The federated MLP AKI 24h model considered administration of Dexmedetomidine as an important variable. The federated LR AKI 48h model considered administration of Phenylephrine slightly important. Like the pooled setting, we can see discrepancies between feature importances which are not considered universally important at all sites.

Correlation of feature importances across model architectures
To investigate differences of feature importance between model architectures, we looked at the correlation between importance rankings of features shared by both the MLP and LR model for each setting and framework. Fig 8 shows these correlations, where the x and y axes are the importance of the feature in the MLP and LR model respectively. In the AKI 24h setting, local models have moderately strong positive correlations with Pearson-correlation coefficients (PC) ranging from 0.79-0.84. The pooled and federated AKI 24h model shows slightly weaker positive correlations as compared to the local models PC = 0.79, 0.77. These results suggest that, within the AKI setting, the pooled and federated models were not successful at decreasing feature discrepancies between the LR and MLP architectures that were present in local models. Sepsis prediction setting results are available in Supplemental Information and S9 Fig.

Correlation between local feature importances and non-local framework feature importances
To investigate the correlation of heterogeneous features between local frameworks and both pooled and federated frameworks, we established the 'unique importance score' (UIS). The UIS score is large for features that are highly important at a small subset of sites, whereas it is small for features that are considered universally important (i.e., features that were important at a plurality of sites). In other words, the score is large for features that lie in the bottom right region of the plots in Figs 5, 6, and 7. Calculation of the UIS score can be found in the Methods section. Fig 9 shows the correlations of the UIS score across frameworks. Similar conclusions can be derived from both pooled and federated framework analyses, in both sepsis and AKI. There is a strong positive correlation (PC ranging from 0.84-0.93) between local UIS and pooled/federated. Interestingly, for all analyses, confidence on the line of best fit decreases at larger UIS scores. This suggests that features considered universally important in the local framework were important for the pooled/federated models whereas features only important at a small subset of hospitals were disregarded. Tables 1 and 2 show demographic profiles across each site for AKI and sepsis patients, respectively. For both AKI and sepsis settings, sites show similar gender distributions, with a slight majority of patients being male across all sites. Age distributions are also similar across all sites with most patients being between 50-75 years old. Patient BMI is similar across sites with most patients having a BMI between 23-34. Site 199 has slightly fewer patients with a BMI of less than 23, and more patients with a BMI of greater than 34 compared to other sites. In both settings, there was a disparity in the number of patients that underwent elective surgery, with the proportions ranging from 0.12-0.28. Patients show differences in racial breakdowns across sites. The African American population varies across sites from 0.02/0.01 (AKI/sepsis) at Site 199 to 0.3/0.32 at Site 243. Site 73 has a relatively large population of Hispanic individuals compared to other sites, whereas Sites 122, 243, 252, and 458 have no Hispanic patients. The Asian population is similar across all sites. The 'Other' racial category has the largest proportion of individuals across all sites, but this proportion varies largely depending on the site, ranging from 0.67-0.98. As previously mentioned, most of the patients in all settings (AKI 24h, 28h, and sepsis) were negative for the disease. For AKI 24h and AKI 48h settings, the proportion of AKI positive patients ranges from 0.06/0.08 (24h/48h) to 0.1/0.13. For the sepsis setting, the proportion of positive patients ranges from 0.02 to 0.20. Table 3 shows general site information for the 7 hospitals. The sites are located across the Northeast, Midwest, and South of the continental United States of America. All sites are large with greater than 500 beds. There are differences in patient unit types across all sites.

PLOS DIGITAL HEALTH
used at multiple hospitals, the proportion of patients that were on the medication at each hospital varied greatly. Coupled with the disparities in unit types, this suggests that each hospital site treats significantly different populations of individuals, despite all these hospitals having patients who suffer from AKI and sepsis.

Discussion
In this study, to investigate the effect of data heterogeneity on the performance of FL, multiple machine learning models were developed to predict the risk of both AKI and sepsis diseases in multiple ICU settings. Different types of EHRs including lab tests, vital signs, demographics, and medications were extracted from seven hospitals in the eICU collaborative research database. Three model frameworks including local, pooled, and federated were explored. Effects of data heterogeneity across hospital sites were evaluated through model performance comparison and feature importance analysis. The sources of data heterogeneity across hospitals were investigated based on patient demographics, medication usage, and general hospital attributes. Our prediction models have shown comparable performance with state-of-the art AKI and sepsis prediction studies [7]. In addition, federated model frameworks generally outperform their local counterparts in our results. However, this largely depends on how heterogeneous the patient populations from different hospitals are. Moreover, the pooled model did not show much improvement over the local models, this could be largely due to the cross-site sample heterogeneity. Though FL performed better than pooled models in our investigations, our FL strategy is based on federated average which did not consider such cross-site heterogeneities, thus it is difficult to justify the generalizability of the conclusion. One reason as to why the federated models performed better than pooled models might be due to their weight distribution. The weights of the federated models are concentrated around zero as compared to pooled models for all settings. More weights near zero means that the models are regularized and simpler, which is likely to generalize better [29].   Step The performance heterogeneity of predictive models across sites and frameworks was evaluated by comparing feature importance. For both AKI and sepsis prediction tasks, important variables identified by predictive models were consistent with prior studies [7]. For example, creatinine and furosemide exposure showed positive associations with AKI, which is unsurprising given their clinical association with AKI. For the same model architecture, importance of a feature varied depending on the sites, with variable-prediction relationships changing (see Fig 5, 6, 7). The presence of 'universally important features' (i.e., features that were considered highly important at most sites) and 'uniquely important features' (i.e., features that were highly important at a small subset of sites) showed that there was disagreement on relative importance across sites. The feature heterogeneity plots for federated and pooled frameworks showed a decreased amount of uniquely important features. This was indicative of both these frameworks being able to attribute higher importance to features shared across multiple sites.
Our findings also demonstrated that federated and pooled models were not successful at decreasing feature importance discrepancies between LR and MLP architectures, and that both pooled and federated frameworks prioritized features that were considered important across a plurality of sites (i.e., low UIS) and attributed lower importance to features that were uniquely important at a small subset of sites (i.e., high UIS). These findings also suggest that the federated model may be better at discriminating the key features of patient-level clinical, lab, and demographic information that improves risk prediction. In the field of critical care medicine, the implication of this finding is that across heterogeneous sources of data, federated models are more likely to highlight the common elements that can better predict sepsis and AKI between hospital, patient, and practice-specific circumstances, thus highlighting the generalizability of the model's value. However, consequently, it is possible that important local characteristics that may better predict AKI and sepsis within hospitals could be overlooked when compared to pooled or local models, which may in turn limit the clinical utility of these tools, a finding that is increasingly being acknowledged in the AI/ML literature.
Within our analysis, differences in features of the hospitals, and ICUs were notable. Many sites did not have any patients admitted to ICU types that other sites had a high proportion of patients within, for example the Medical Surgery ICU or SICU. At these hospitals, different sites treat different conditions. Thus, treatments may vary depending on the etiology and nature of the condition driving sepsis and AKI [30]. For example, a patient managed for decompensated heart failure in a cardiac ICU who subsequently develops AKI may be treated with inotropic support and furosemide, whereas a patient being managed for septic shock in a medical ICU with AKI may be aggressively repleted with intravenous fluids. As such different hospitals, which are specialized at treating different conditions, may have slightly differing medication regimens for treating patients when faced with the same disease, which in turn may be a function of the practicing physician and their choice of treatment options including medications, prespecified protocols, or even higher-level decisions about cost within centralized hospital pharmacies. Our models highlighted this putative disagreement between medication usage at hospitals, creating another source of heterogeneity in model training. This heterogeneity in medications and demographic details was demonstrated in the feature heterogeneity plots since features with higher UIS scores tend to be medications and demographic information. Differently, lab tests and vital signs were generally universally important features across hospitals, likely because these are commonly standardized across hospitals. Taken together, local frameworks may heavily suffer in generalizability even when population demographics are similar across sites, due to disagreements in medication and treatment administration. However, clinicians may find use in site-specific factors, which may not be evident in federated frameworks and only be ascertainable within a local framework. As such, while federated frameworks may provide performance increases, local frameworks can still provide clinical value in determining important site-specific factors for risk prediction.

Limitations
There are several limitations to our study. First, we mainly considered structured clinical information to construct the features. Integrating unstructured free text to build predictive models may obtain better predictive performance and allow a new level of explainability of prediction. Second, we only considered LR and MLP to build predictive models based on local, pooled, and federated frameworks. Other algorithmic solutions such as support vector machines may have a potential to improve model performance. Third, we mainly focus on describing the effects of data heterogeneity in FL in terms of disease risk prediction. Considering data harmonization techniques and other federated techniques to mitigate the problem and improve the performance is one of future research topics. Moreover, federated techniques that deal with data heterogeneity while simultaneously reducing communication costs may be required for real-time medical use.

Ethics statement
This study analyzed a publicly available anonymized database (eICU Collaborative Research Database) with preexisting institutional review board approval. Collection of data was in accordance with the ethical standards set out by the IRB no. 0403000206 of the Massachusetts Institute of Technology and with the 1964 Declaration of Helsinki and its later amendments. Because the database is fully anonymized, formal consent was not required to use the data.

Data aggregation
Patient data was extracted from the eICU Collaborative Research Database, a multi-center critical care database made publicly available through Philips Healthcare and the MIT Laboratory for Computational Physiology (https://eicu-crd.mit.edu/). The database contains detailed information regarding the clinical care of ICU patients. We investigated three disease settings (24h or 48h observation window (OW) AKI, and sepsis). An AKI (non-graded) is defined as any of the following: Increase in serum creatinine (SCr) by > = 0.3 mg/dl (> = 26.5 μmol/l) within 48 hours, increase in SCr to > = 1.5 times baseline which is known or presumed to have occurred within the prior 7 days, or urine volume < 0.5 ml/kg/h for 6 hours. We predict AKI risk using an accumulating OW (S1 Fig). We predicted AKI within the next 24 hours (prediction window, PW) following the end of the OW, focusing only on the first 3 days (72 hours) of a patients' inpatient hospital stay (max OW = 48 hours). For each patient, we created 2 pairs of OW/PWs, specifically using from OW = 1-24 hours (1-day) after admission, 1-48 hours (2-days). We do not consider the onset point. For the AKI prediction experimental setting, positive cases are samples that are diagnosed as AKI in the prediction window whereas controls are samples that are not diagnosed as AKI in the prediction window. For sepsis prediction, we labeled patient data in accordance with Sepsis-3 clinical criteria. We predicted whether patients would suffer from sepsis 6 hours prior to onset, onset point included. For sepsis prediction experimental setting, positive cases are samples that are diagnosed as sepsis. Controls are samples that are not diagnosed as sepsis. For patients who did not develop sepsis, predictor values were selected from a random T-hour time window (T is usually set as 48 or 24 hours) during the patient's ICU stay. For those who developed sepsis, a time was selected for the patient within admission to 6 hours prior to the onset of sepsis, and the predictor values were extracted. Data was collected from seven hospitals with the following IDs: 420, 122, 243, 252, 458, 199, and 73. For all three disease predictions (24h or 48h AKI, and sepsis), all hospital sites shared all features including: general demographic information (8 variables), vital signs/ lab tests (29 variables), and medications (254 medications). For 28 vital signs and lab tests, the max, min, first, and last values are calculated. For urine, only the summation is calculated. Taken together, a total of 354 features were available at every hospital site for each patient.

Data processing
For all datasets, we performed an automated curation process outlined as follows: (1) systematically identified extreme values of numerical features (e.g., vital signs/lab tests and some demographic information) that were beyond the 1st and 99th percentile as outliers. We marked these values as missing. Primarily, this step marked values within demographic data (BMI, age) and some vital signs as missing. Values marked as missing were investigated through clinical literature to confirm that they were physiologically impossible. Previous studies utilizing the eICU Collaborative Research Database have noted these errors are at random and can be removed in downstream analyses [31][32]. (2) We standardized all our variables appropriately by normalizing all our numerical features and converting binary features to either 1 or -1. (3) For all missing measurements, the Multiple Imputation by Chained Equations algorithm (MICE) was used. MICE imputation can calculate missing information by taking advantage of the relationships between non-missing measurements within the dataset. Because overall patient distributions are conserved after outlier removal (due to limited number of values being considered outliers), MICE imputation can provide robust estimation of these values as well [33].

Experimental design
There were three prediction tasks including 24-hour and 48-hour prediction of AKI, and sepsis prediction. Three model frameworks were designed including local, pooled, and federated model frameworks. The local model framework only used data from each site itself. The pooled model framework combined data from all sites. In the federated model framework, each local site does not have access to other sites' data. A model was trained locally, and its parameters were shared to a central aggregator, which was used to update global model parameters which were subsequently sent back to each site. For each framework, LR and MLP were used as model architectures, so there are 54 tasks in total were performed (7 site-specific (local) x 3 prediction tasks x 2 architectures + 1 pooled model x 3 prediction tasks x 2 architectures + 1 federated model x 3 prediction tasks x 2 architectures). For all settings, five-fold cross-validation was used during training models. The Shapely Additive exPlanations (SHAP) tool was used to calculate feature importance rankings for each task. The Markov Chain Type 4 rank aggregation was used to combine the feature importance rankings for all five folds.

Learning algorithm
To investigate the effects of heterogeneity across architectures, we focused on two learning models: multilayer perceptron (MLP) and logistic regression (LR). The MLP is a class of feedforward artificial neural network (ANN) with a non-parametric functional form [34]. An MLP consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Except for the input nodes, each node is a neuron that uses a nonlinear activation function. MLP utilizes a supervised learning technique called backpropagation for training. Its multiple layers and non-linear activation distinguish MLP from a linear perceptron. It can distinguish data that is not linearly separable. Since MLPs are fully connected, each node in one layer connects with a certain weight to every node in the following layer [35]. To implement the MLP model, Python's PyTorch library was used. PyTorch is an open-source machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab [36]. All MLP models had one hidden dense layer of 10 units, learning rate = 0.001, used binary cross-entropy loss, and stochastic gradient descent optimization. To mitigate class imbalance, class weights were used to penalize the loss for positive class inaccuracies. This allows the model to pay increased attention to examples from the positive class despite a skewed class distribution [37]. Each model was trained for 200 epochs and the batch size was 64. An epoch is the total number of iterations it takes all the training data to make one pass through the model whereas, the batch size is the number of samples processed in each iteration before the model is updated [38].
In addition to the MLP model, we implemented a LR model. The LR model has a parametric functional form and formulates the log-odds of an event as a linear combination of independent variables [39]. The LR model consists of one linear layer followed by sigmoid activation. Like the MLP, a learning rate = 0.001, binary cross-entropy loss, and stochastic gradient descent optimization was used. Class weights were applied in a similar fashion to the MLP model. For consistency and to enable direct comparisons, all models of each framework for all tasks were built with the same architecture.
Because the output of an MLP model is a nonlinear function of the inputs, the decision boundary for classification from an MLP is also nonlinear, which provides more flexibility than LR models [34]. As such, we wanted to investigate the effects of heterogeneity across these two different architectures.
Our primary model framework of interest was a federated learning model. In this model, training was performed in different sites, and parameters were shared to a central location. To create a federated model using both the MLP and LR architecture, the federated averaging technique was used. The process was as follows: a central aggregator initialized the federated model with random parameters. This model was sent to each site, then trained for one epoch. Next, model parameters were sent back to the central aggregator where federated averaging was performed. Updated parameters from the central aggregator were then sent back to each site, and this cycle was repeated for multiple epochs. Federated averaging scales the parameters of each site according to the number of available data points and sums all parameters by layer. Through this technique, federated models did not receive any raw data. Class weighting was performed at each site on every cycle, which ensured local data distribution information was not sent to the global server. All parameters for local server models were kept the same to enable comparison. We were able to perform federated class weighting through this mechanism because local data distributions were similar across hospitals.
Assume M local sites, each with N m samples (number of samples in m-th local site). w ðmÞ n is the weight of the n-th sample at the m-th site, y ðmÞ n is the ground-truth label for sample x ðmÞ n , which is the n-th sample at m-th site: The base model O is initialized at the global server. Without loss of generalizability, the following steps assume a LR model described by Eq (1). b y n is the predicted value for sample x n by the LR model with parameters β: All β m are transferred to the global server where the layer weights of all β m are averaged through Eq (3), generating an updated global model for the next epoch: The time complexity of one iteration of federated averaging is O(Z m N m ) for client m, where Z m is the number of parameters in the model. The communication cost of one iteration is O (Z m ).

Evaluations
We used the area under the receiver operator curve (AUROC) to compare the overall prediction performance, which is known to be more robust to imbalanced datasets. In addition to AUROC, accuracy, precision, and recall were calculated. In addition to aggregate performance metrics for each model, training loss and training/testing AUROC histories were measured. Tests of significance were performed using the student's t-test. Feature importance rankings for each task were computed using SHAP. To focus more on the most impactful features (i.e., variables ranked among top 100) without losing information on the weaker features, we assigned a "soft" membership of a feature as how high up the rank is relative to tops (s = 100) by applying an exponentially decreasing function to the original ranks (r), i.e., f(r) = exp{−r/s}. For some top features, SHAP dependence plots were generated to illustrate the effect that each feature has on the predictions made by the model. Locally Weighted Scatterplot Smoothing (LOWESS) was used to fit a smooth trend line to the dependence plots.
The unique importance score (UIS) was calculated for each model architecture-settingframework combination. For local model analysis, the mean importance i lj for each feature j was calculated by averaging all soft rankings for said feature across all sites. This was done for all top 100 features at each local site. For both pooled and federated analysis, importance (i pj or i fj ) of each feature j was set to the soft ranking of said feature within the pooled or federated model. In all model frameworks, the frequency f of each feature was calculated by determining the proportion of local sites the feature was a top 100 feature. Given i lj , i pj , i fj , and f:  . The x-axis gives the raw values of each feature, and the y-axis gives the logarithmic of estimated odds ratio (i.e., the SHAP value) for sepsis, AKI 24h or AKI 48h, when a feature takes a certain value. Each dot represents the SHAP value of a sample. The LOWESS curve, used for smoother extrapolating across all the dots, is plotted in all panels for each site. (a, c, and e) show Shapley dependence plots for federated MLP models and (b, d, and f) show Shapley dependence plots for federated LR models. (a, b) show plots for sepsis, (c, d) show plots for AKI 24h, and (e, f) show plots for AKI 48h. (TIF) S5 Fig. Medication usage across local sites. (a, b) shows frequency of medications across hospitals. X-axis is the number of hospitals and y-axis is the number of medications. For example, there are~20 medications that only appear at 1 hospital. (c, d) show disagreement of medication usage across hospitals for medications that appear at 2 or more hospitals. X-axis shows the standard deviation bins of proportions of patients using the medication at each hospital (i.e., larger values of standard deviation indicate more disagreement). Y-axis shows the number of medications within the histogram bin. (d-f) show feature importances for LR models. Each dot corresponds to one of the most important features ranked among the top-100 by at least one of the seven models; y-axis measures the proportions of sites that identified the feature as top-100, or "commonality across sites"; x-axis measures the mean of feature importance rankings measured as "soft ranking" (the closer it is to 1, the higher the feature ranks). Top-100 is an arbitrary cutoff we used to analyze the most important features to illustrate heterogeneity. In (a, d) each feature is also color coded by the interquartile range (IQR) of the ranks across sites (the higher the IQR is, the more disagreement across sites on the importance of that feature).